Genome assembly of the hybrid grapevine Vitis ‘Chambourcin’

‘Chambourcin’ is a French-American interspecific hybrid grape grown in the eastern and midwestern United States and used for making wine. Few genomic resources are available for hybrid grapevines like ‘Chambourcin’. Here, we assembled the genome of ‘Chambourcin’ using PacBio HiFi long-read, Bionano optical map, and Illumina short-read sequencing technologies. We generated an assembly for ‘Chambourcin’ with 26 scaffolds, with an N50 length of 23.3 Mb and an estimated BUSCO completeness of 97.9%. We predicted 33,791 gene models and identified 16,056 common orthologs between ‘Chambourcin’, V. vinifera ‘PN40024’ 12X.v2, VCOST.v3, Shine Muscat and V. riparia Gloire. We found 1,606 plant transcription factors from 58 gene families. Finally, we identified 304,571 simple sequence repeats (up to six base pairs long). Our work provides the genome assembly, annotation and the protein and coding sequences of ‘Chambourcin’. Our genome assembly is a valuable resource for genome comparisons, functional genomic analyses and genome-assisted breeding research.


METHODS PacBio HiFi, Bionano optical map, and Illumina sequencing
'Chambourcin' leaf material was obtained from a 12-year-old experimental vineyard located at the University of Missouri Southwest Research Station in Mount Vernon, Missouri, USA. For PacBio HiFi sequencing, high molecular weight (HMW) DNA was isolated using the Nucleobond Kit (Macherey-Nagel, Bethlehem, PA, USA) following the manufacturer's protocol. Approximately 20 cg DNA was sheared to a center of mass of 10-20 kilobase (kb) in a Megaruptor 3 system. Next, a HiFi sequencing library was constructed following HiFi SMRTbell protocols for the Express Template Prep Kit 2.0 according to manufacturers' recommendations (Pacific Biosciences, California). The library was sequenced using Sequel binding and sequencing chemistry v2.0 in circular consensus sequencing (CCS) mode in a Sequel II system with a movie collection (file format of HiFi data) time of 30 h. The HiFi reads were generated with the CCS mode of pbtools [12] using a minimum Predicted Accuracy of 0.990.
For the Bionano data, DNA was isolated from fresh young leaf tissue from a 12-year-old experimental vineyard located at the University of Missouri Southwest Research Station in Mount Vernon, Missouri, USA, using the Prep™ Plant DNA Isolation kit and labeled using the Bionano Prep™ DNA Labeling Kit Direct Label and Stain (DLS) (Bionano Genomics, San Diego, California). In total, 500 ng of ultra-high molecular weight (UHMW) DNA was used for the DLS reaction. DNA was incubated in the presence of DLE-1 Enzyme, DL-Green, and DLE-1 Buffer for 3:20 h at 37°C. This was followed by proteinase K digestion at 50°C for 30 min and double cleanup of the unincorporated DL-Green label. The resulting DLS sample was combined with the Flow Buffer, dithiothreitol (DTT) and DNA stain, mixed at slow speed in a rotator mixer for an hour, and then incubated overnight at 4°C.
The labeled sample was then loaded onto a Bionano flow cell in a Saphyr System for separation, imaging, and creation of digital molecules according to the manufacturer's recommendations [13]. The raw molecule set was filtered to a molecule length of 250 kb and a minimum of nine CTTAAG labels per molecule. Bionano maps were assembled without pre-assembly using the non-haplotype parameters with no Complex Multi-Path Regions (CMPR) cut and without extend-split. Bionano software (Solve, Tools and Access, v1.5.1) [14] was used for data visualization, processing, and assembly of Bionano maps.

Genome assembly
The PacBio HiFi assembly was generated using the Hifiasm assembler (RRID:SCR_021069; v0.13-r308) [18] with default parameters. To reduce the number of small, low-coverage artifactual contigs often generated by Hifiasm [18], the assembly was filtered to exclude less than 70,000 bp contigs. The resulting HiFi contigs were merged to the DLS Bionano maps with Bionano Solve (v3.5.1) [14] using the hybridscaffold.pl script of Bionano Solve (v3.5.1) [14] to get a hybrid assembly. Each scaffold of the hybrid assembly was then checked, and small overlapping contigs were curated and removed to make a contiguous sequence. This curated diploid assembly was examined to identify alternative contigs using Purge_Haplotigs (v1.1.1; RRID:SCR_017616) [19], and the primary assembly and haplotig assemblies were created. We mapped trimmed Illumina whole genome sequences to both assemblies separately with bowtie2 (v2.3.4; RRID:SCR_016368) [20] and samtools (v1.9; RRID:SCR_002105) [21]. The resulting .bam files were used for polishing both assemblies using Pilon (v1.23; RRID:SCR_014731) [22] with one round, and the final assembly (primary assembly) and haplotig assemblies were prepared. In this study, we used only the primary assembly for all downstream analysis, but the haplotigs are maintained to cover the total heterozygous genome. Scaffolds were aligned to the V. vinifera 'PN40024' 12X.v2 [23] S. Patel et al.
The rhAmpSeq markers were designed to target the core Vitis genome and were developed from gene-rich collinear regions of 10 Vitis genomes [25]. These markers aid in mapping contigs on chromosomes and checking their orientation.
The resulting gene predictions (proteins, coding sequences, and annotations) were completed separately for the 'Chambourcin' primary assembly and the 'Chambourcin' haplotig assembly. The quality of the predicted proteins was assessed using BUSCO (v5.4.2) [27] with protein mode and the embryophyta_odb10 dataset. The predicted proteins of the Vitis 'Chambourcin' primary assembly were then functionally annotated using eggNOG-mapper (v2) (RRID:SCR_021165) [36] and related to Gene Ontology (GO), KEGG pathway, and other functional information. The GO plot was developed using the WEGO tool [37]. For the analysis of orthologous gene models, the sequences of 'Chambourcin' primary gene models, V. vinifera PN40024 12X.v2, VCost.v3, Shine Muscat, and V. riparia Gloire were analyzed using OrthoVenn2 [38] with default settings, E-value: 1 × 10 −5 , and inflation value: 1.5.

Plant transcription factors prediction, phylogenetic tree, and WRKY classification
The plant transcription factors for the gene models of the 'Chambourcin' primary assembly, V. vinifera PN40024 12X.v2, and VCost.v3 were identified using the Plant Transcription Factor Database (PlantTFDB v5.0; RRID:SCR_003362) [39]. The identified transcription factors were divided into subfamilies according to their sequence relationship with V. vinifera. For the circular phylogenetic tree and WRKY sequences of 'Chambourcin' primary gene models and V. vinifera PN40024 12X.v2, VCost.v3 gene models retrieved from PlantTFDB (5.0) [39] and aligned using ClustalW method in MEGA7 [40]. A phylogenetic analysis was carried out using the neighbor-joining method with 1,000 bootstrap replications, and the evolutionary distances were computed using the Poisson correction method with the Pairwise Deletion option. The WRKY classification of 'Chambourcin' primary gene models was carried out using the same method described in a previous study [41].

Genome sequencing and assembly of 'Chambourcin'
We generated a high-quality and contiguous genome sequence of 'Chambourcin' using sequenced to date [23,28,29]. Relatively higher levels of heterozygosity in the 'Chambourcin' genome compared to other Vitis species are expected, given the complex interspecific pedigree of this cultivar. A GenomeScope plot of clean reads demonstrated two peaks of coverage; the first peak located at 25X coverage corresponds to the heterozygous portion of the genome, and the second peak at 52X coverage corresponds to the homozygous portion of the genome (Figure 1).
A de novo 'Chambourcin' genome was assembled using HiFi, Bionano, and Illumina data. First, a contig assembly of the PacBio HiFi reads resolved the reads into 196 contigs with an N50 of 12,215,205 bp and a total length of 949,347,381 bp ( Table 1). The PacBio HiFi contig assembly was then merged with the Bionano maps to get an initial hybrid assembly comprising 67 scaffolds with an N50 length of 16 Table 1). The 'Chambourcin' primary genome assembly was aligned to the reference genomes V. vinifera 'PN40024' 12X.v2 [23] (see Table 1 in GigaDB [45]), Shine Muscat [28], and V. riparia 'Gloire' [29]. A dot plot was generated to facilitate the comparisons among genomes. Collinearity between 'Chambourcin' and V. vinifera 'PN40024' 12X.v2, Shine Muscat, and V. riparia 'Gloire' was observed as a straight diagonal line without large gaps in the dot plot, confirming the high synteny of the 'Chambourcin' genome with V. vinifera 'PN40024' 12X.v2 (Figure 2A), Shine Muscat ( Figure 2B), and V. riparia 'Gloire' (Figure 2C). To further validate the 'Chambourcin' genome assembly, we mapped 'Chambourcin' rhAmpSeq markers [25] to the 'Chambourcin' genome assembly. We found 99% of rhAmpSeq markers mapped to 'Chambourcin' scaffolds and mapped to the same chromosomes and positions the markers were derived from in the collinear Vitis core genome (see Table 2 in GigaDB [45]).

Genome assembly analysis using K-mer spectra plot
The genome quality was assessed with Illumina whole genome reads separately for diploid, primary, and haplotig genome assemblies using KAT tool [31] and K-mer spectra plots were generated. A K-mer spectra is a graphical representation showing how many k-mers appear a certain number of times. The frequency of occurrence is plotted on the x-axis and the number of k-mers is plotted on the y-axis. All K-mer spectra plots for 'Chambourcin' diploid, primary and haplotig assemblies were identified with an error distribution under 10X, a heterozygous peak at 35X, and a homozygous peak at 67X (Figure 3A-C). The different colors in the K-mer spectra plot shows the different occurrences of k-mers. The black color represents read content occurs at zero time (0X), the red color represents unique content occurs at one time (1X), the purple color represents content occurs at two times (2X) and the green color represents content occurs at three times (3X). The K-mer spectra plot of the diploid genome assembly shows that the read content shown in black color is absent from the assembly, and red peak occurs once, showing most of the heterozygous content. At the same time, purple peak indicates more duplications on homozygous content ( Figure 3A).
The K-mer spectra plot of the primary genome is more collapsed, including mostly a single copy of the homozygous content and less of the heterozygous content ( Figure 3B). The K-mer spectra plot for the haplotig genome assembly identified two black peaks representing read content in both the heterozygous and homozygous regions ( Figure 3C).
These K-mer spectra plots provides useful information for genome assembly assessment using whole genome short reads to identify duplicate regions in the assembly and visualize the genome assembly. This visualization is useful for genome assembly curation steps to identify accurate primary and haplotig assembly from a diploid genome assembly.

Gene annotation and orthologous genes
A total of 33,791 gene models were predicted for the 'Chambourcin' primary genome assembly (Figure 4). We identified 94.6% complete BUSCOs (C); of these, 86.9% were designated single-copy BUSCOs (S), and 7.7% were designated duplicated BUSCOs (D) ( Table 3). As evidenced by the high number of complete single-copy genes identified, the BUSCO results indicate that the 'Chambourcin' primary genome assembly offers     gene models were identified with GO accessions and further classified into three sub-ontologies: biological process (11,399), cellular component (11,472), and molecular function (9,977) ( Figure 5F) (GigaDB Table 4 [45]). A total of 8,460 'Chambourcin' primary proteins were annotated with KEGG pathways (GigaDB Table 4 [45]). Using OrthoVenn2, we

Mapping of Illumina whole genome reads, and RNA-seq reads to the genome assembly
We aligned trimmed Illumina whole genome reads to 'Chambourcin' diploid, primary, and haplotig assemblies separately and obtained an average of 95.08%, 90.17%, and 75.88% mapping results, respectively (Table 5). We also separately mapped trimmed RNA-seq reads to 'Chambourcin' diploid, primary, and haplotig assemblies and obtained 87%, 80.81%, and 60.77% mapping results, respectively. The mapping results of both trimmed Illumina whole genome and trimmed RNA-seq reads to genome assemblies show that most reads were mapped to the diploid assembly, followed by primary and haplotig assemblies. The mapping results for the primary genome assembly retained most of the genome from the diploid assembly, while the smallest number of mapped reads belonged to the haplotig assembly. These results suggest that the smallest genome portion is missing in the haplotig assembly.

CONCLUSION
In this study, we presented the genome assembly of 'Chambourcin', a complex interspecific hybrid grape cultivar, using PacBio HiFi long read sequencing, Bionano third-generation sequencing data, and Illumina short read data. The comparative genomic analyses of 'Chambourcin' with the reference genome of V. vinifera 'PN40024' 12X.v2, Shine Muscat, and V. riparia 'Gloire' indicated that the 'Chambourcin' genome aligns well with other grape genomes without any large structural variation. Ortholog analyses of the 'Chambourcin' primary gene models, V. vinifera 'PN40024' 12X.v2, VCost.v3, Shine Muscat, and V. riparia 'Gloire', revealed that our 'Chambourcin' genome assembly and gene annotations are a high-quality grapevine resource for the research community. Interspecific hybrids derived from two or more Vitis species are common in nature [1]. They are the cornerstone of grapevine rootstocks grown worldwide, cultivars that predominate in eastern and midwestern North America, and new disease-resistant genotypes currently in development [46]. The sequencing data, scaffold assemblies, and gene annotations of the 'Chambourcin' genome assembly described here provide a valuable resource for genome comparisons, functional genomic analyses, and genome-assisted breeding research.

DATA AVAILABILITY
The PacBio HiFi and Illumina whole genome reads are deposited in the NCBI BioProject with accession PRJNA754438. The Sequence Read Archive (SRA) accession of the PacBio HiFi reads is SRR15530464, and the SRA accession of the Illumina whole genome reads are SRR24093946, SRR24093988, SRR24095403, and SRR24097763. The Bionano maps, genome assembly, gene annotation, proteins, and other data are available on figshare [47]. Supplementary tables and additional data is in the GigaDB repository [45].